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(54) Combining frequency warping and spectral shaping in HMM based speech recognition 



(57) Frequency warping approaches to speaker 
normalization have been proposed and evaluated on 
various speech recognition tasks. In all cases, fre- 
quency warping was found to significantly improve rec- 
ognition performance by reducing the mismatch 
between test utterances presented to the recognizer 



FIG 



and the speaker independent HMM model. This inven- 
tion relates to a procedure which compensates utter- 
ances by simultaneously scaling the frequency axis and 
reshaping the spectral energy contour. This procedure 
is shown to reduce the error rate in a telephone based 
connected digit recognition task by 30%. 
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Description 

FIELD OF THE INVENTION 

This invention relates to speech recognition systems generally, and more particularly to a signal processing tech- 
nique wn.cn combines frequency warping and spectral shaping for use in hidden Markov model-based speech recoa- 
nition systems. 8 

BACKGROUND OF THE INVENTION 

i 

Speech recognition is a process by which an unknown speech utterance (usually in the form of a digital PCM siq- 
nal) is identified. Generally, speech recognition is performed by comparing the features of an unknown utterance to the 
features of known words or word strings. 

The features of known words or word strings are determined with a process known as "training". Through training 
one or more samples of known words or strings (training speech) are examined and their features (or characteristics) 
recorded as reference patterns (or recognition unit models) in a database of a speech recognizer. Typically, each rec- 
ognition unit model represents a single known word. However, recognition unit models may represent speech of other 
lengths such as subwords (e.g., phones, which are the acoustic manifestation of linguistically-based phonemes) Rec- 
ognrtion unit models may be thought of as building blocks for words and strings of words, such as phrases or sentences 
To recognize an utterance in a process known as "testing", a speech recognizer extracts features from the utter- 
ance to characterize it. The features of the unknown utterance are referred to as a test pattern. The recognizer then 
compares combinations of one or more recognition unit models in the database to the test pattern of the unknown utter- 
ance. A scoring technique is used to provide a relative measure of how well each combination of recognition unit mod- 
els matches the test pattern. The unknown utterance is recognized as the words associated with the combination of one 
or more recognition unit models which most closely matches the unknown utterance 

Recognizers trained using both first and second order statistics (i.e.. spectral means and variances) of known 
speech samples are known as hidden Markov model (HMM) recognizers. Each recognition unit model in this type of 
recognizer is an N-state statistical model (an HMM) which reflects these statistics. Each state of an HMM corresponds 
in some sense to the statistics associated with the temporal events of samples of a known woid or subword An HMM 
is characterized by a state transition matrix, A (which provides a statistical description of how new states may be 
reached from old states), and an observation probability matrix. B (which provides a description of which spectral fea- 
tures are likely to be observed in a given state). Scoring a test pattern reflects the probability of the occurrence of the 
sequence of features of the test pattern given a particular model. Scoring across all models may be provided by efficient 
dynamic programming techniques, such as Viterbi scoring. The HMM or sequence thereof which indicates the hiqhest 
probability of the sequence of features in the test pattern occurring identifies the test pattern 

A major hurdle in building successful speech recognition systems is non-uniformity in performance thereof across 
a variety of conditions. Many successful compensation and normalization techniques have been proposed in an attempt 
to deal with differing sources of non-uniformity in performance. Some examples of typical sources of non-uniformity in 
performance in telecommunications applications of speech recognition include inter-speaker, channel, environmental 
and transducer variability, and various types of acoustic mismatch. 

Model adaptation techniques have been used to improve the match during testing (i.e.. during recognition of 
unknown speech) between a set of unknown utterances and the hidden Markov models (HMMs) in the recognizer data- 
base. Some model adaptation techniques involve applying a linear transformation to the HMMs. The parameter of 
such a linear transformation can be estimated using a maximum likelihood criterion, and then the transformation is 
applied to the parameters of the HMMs. A perplexing problem not heretofore solved is the existence of speakers in a 
population for whom speech recognition performance does not improve after model adaptation using such linear trans- 
formation techniques. This is especially true for unsupervised, single utterance-based adaptation scenarios 

It is generally thought that only those distributions in the HMMs that are likely to have been generated (during train- 
ing) by the unknown utterance have a chance to be mapped more closely to the target speaker with such linear trans- 
formation model adaptation techniques. Therefore, if the "match" between the HMMs and the unknown utterance is not 
reasonably "good" to begin with and the number of unknown utterances is limited (such as for example in a single utter- 
ance-based adaptation scenario), then the utterance cannot "pull" the model to better match the target speaker in such 
conventional model adaptation techniques. Thus, there exists a subset of utterances for which model adaptation does 
not improve speech recognition performance. 

Frequency warping for speaker normalization has been applied to telephone-based speech recognition applica- 
tions. In previous testing practice, frequency warping for speaker normalization has been inplemented by estimating a 
frequency warp.ng function that is applied to the unknown input utterance so that the warped unknown utterance is bet- 
ter matched to the given HMMs. As is the case for model adaptation, there exists a subset of utterances for which fre- 
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quency warping does not improve speech recognition performance. ^ 

The frequency warping approach to speaker normalization compensates mainly for inter-speaker vocal tract length 
variaWlfty warping oflne frequency axis (i.e.. applying a frequency transformation to the frequency axis ,n the 

£^ by a factor a. where an a = 1 .00 corresponds to no warping (no frequency 

^StaSS a conventional speech recognizer processes samples of an unknown speech The samples 
are obtain™ f rom recording windows of a specif ied duration (e.g.. 1 0 ms) on the unknown speech signal and I such en- 
dows mToveXp The samples of the unknown speech signal are processed using a test Fourier transform (FFT1 , com- 
P^rTTheoSut of the FFT component is further processed and coupled to a mel-scale f.lterbank which jto 
S to asTmel^epstrum f nJtanh). The mel-scale fiHerbank is a series of overtopping bandpass 
he Sndpass filters in «ie series have a spacing and bandwidth which increases with frequency along the frequency 
a^sTe^uf^he mel-scale f ilterbankfea spectral envelope. An additional transformation to the spectra, envelope 
provides a sequence of feature vectors. X. characterizing the unknown speech signal _ 
in rxevious practice frequency warping has been implemented in the mel-scale fdterbank of the front-end of the 

Z StoTSS W« to the amount of linearsca.ing of the spacing and bandwidth of the filters w.thin the mel-scale f d- 
efeank JkaSna the mel-scale firterbank in the front-end is equivalent to resampling the spectrarenvelope using a 
££rt£fa r expanded frequency range. Changes in the spectral envelope are directy correlate to var.at.ons ,n 

^InfriSy'warping for speaker normalization according to previous practice, an ensemble of warping factors is 
mJ^n^ZnUig a^ index corresponding to a particular amount of linear scaling, and thus, to a part.cu.ar 
ZSnHnT^^ of the f irters within the mM filterbank. For each utterance, the opfcma. warp.ng factory 
ScS from atfscrete ensemble of possible values so .hat the likelihood of the warped utterance .smax.rn.zed wrth 
re^cTtoa given HMM and a given transcription (i.e.. a hypothesis of what the unknown speech ,s . The values of the 
faclorsTn the ensemble typica.ly vary over a range corresponding to frequency compression or expansion of 
approximately ten percent. The size of the ensemble is typically ten to fifteen d.screte values. 

* Le T" = V (X) denote the sequence of cepstral observation vectors (i.e.. the sequence of feature vectors^ where 
each ciservatfc. vector (i.e.. each feature vector) fe warped by the function al, and the warpjig ^ assumed to be .,n- 
ear. If X denotes the set of HMMs and the parameters thereof, the optimal warping factor .s defined as. 



= arg max P(X° I a, X, H) 



(1) 



a 



where H is the transcription (i.e.. a decoded string) obtained from an initial recognition pass using the unwarped 
sCence of featureTectors X This frequency warping technique is computational* eff icients-nce rnax,m,z,ng h hke- 
V2* in Eq "involves only the forced probabilistic alignment of the warped observation vectors X- to a sing le sfr wq 
H My the frequency-warped sequence of feature vectors ^ is used in a second recognrtion pass to obtain the final 

^eqt^cy warping for speaker normalization according to previous practice transforms an utterance according to 
a parametric transformation. g a (). in order to maximize the likelihood criterion given HBjJ- 

There is a laroe class of maximum likelihood-based model adaptat.on procedures that can be described as para 
mJ^bmS»7tt» HMMs. For these procedures, let X y - h (X.) denote the models obtained by aparametnc 
to Sirh 0 of the original set of HMMs and parametersWeof. The form of the parametnc Inear Mr 
Z^^^ndon the nature of the sources of non-uniformity in speech recognition performance and fte 
S^SSSSbV the sequence of feature vectors) available for estmating the parameters of the transfor- 

^Tmaxirnum likelihood criterion similar to that used for estimating a is used for estimating y: 

y = arg max P(X I y, \, H) (2) 

Y 



The article by McDonough et al. entitled "An Approach To Speaker Adaption Based On Analytic Functons Proa 
Intl Con* on Acoustics, Speech and Signal Processing, Atlanta. GA. May 1996. pages 721-724. suggests tha inan 
HMM-based^speech recognition system the effect of frequency warping for speaker normalizatior ns equivalent to mat 
obteinea Ton TeKher a linear trarXmation applied to the cepstral feature space or a linear transformation appl.ed to 
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the means of the HMMs. In view of that reference, frequency warping for speaker normalization and a linear transfor- 
mation in the cepstral feature space are the same, and would therefore have equivalent effects. McDonough et al 
teaches that combining frequency warping for speaker normalization and a linear transformation in the cepstral feature 
space are redundant. 

SUMMARY OF THE INVENTION 

We have discovered that the ineffectiveness of frequency warping for speaker normalization for a particular subset 
of utterances is due to the interaction of various sources of variability in speech recognition performance in the process 
of estimating the "best" warping function. If both model adaptation and frequency warping for speaker normalization are 
limited by the initial relationship between the HMMs and the input utterance, then the solution to this problem is to 
search for an optimal warping function and an optimal model transformation in the same procedure 

Since a spoken utterance may be simultaneously affected by many sources of speech recognition performance 
variability, and since there may be many acoustic correlates associated with a given source of variability it is important 
that different procedures for compensating for acoustic distortions be tightly coupled with one another. According to the 
pnnciples of the invention, linear model transformation and frequency warping are implemented as a single combined 
procedure in an HMM-based speech recognition system to compensate for these sources of speech recognition per- 
formance variability. 

In an illustrative embodiment of the invention, unknown speech in the form of sound waves is received by an acous- 
tic transducer and is converted into an electrical unknown speech signal. An optimally warped sequence of feature vec- 
tors is determined which characterizes the unknown speech signal, each of the feature vectors in the sequence being 
warped according to an optimal warping factor. Recognition unit models stored in a memory are adapted to the 
unknown speech signal. A plurality of the adapted recognition unit models is compared with the optimally warped 
sequence of feature vectors to determine a comparison score for each such model. The highest comparison score is 
25 selected and the unknown speech is recognized based on the highest score. 

Combining frequency warping of the cepstrum observation sequence and linear transformation of the recognition 
unit models improves speech recognition performance substantially more than when using either of these techniques 
alone, contrary to the teachings of "An Approach To Speaker Adaptation Based On Analytic Functions" by McDonough 
et al. The combined procedure according to the principles of the invention successfully compensates for mismatches 
so between speaker populations used for training and speaker populations encountered during testing in HMM-based 
speech recognition. 

Other aspects and advantages of the inventi on will become apparent from the following detailed description and 
accompanying drawing, illustrating by way of example the features of the invention. 

35 BRIEF DESCRIPTION OF THE DRAWINQ 

In the drawing: 



so 
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FIG. 1 is a schematic view of a speech recognition system in accordance with the principles of the invention- 
FIG i •* 3 SCf1emaliC Vi€W * 8 bank of me| - sca,e I'terbanks for use in the speech recognition system depicted in 

FIG. 3 is a flow diagram for describing frequency warping in accordance with the principles of the invention; 
FIG. 4 is a flow diagram for describing model adaptation in accordance with the principles of the invention; and 
FIG. 5 is a flow diagram for describing a joint optimization process in accordance with the principles of the inven- 
ts tion. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

For a better understanding of the invention, together with other and further objects, advantages, and capabilities 
thereof, reference is made to the following disclosure and the figures of the drawing, where like reference characters 
designate like or similar elements. 

For clarity of explanation, the illustrative embodiments of the present invention are presented as comprising indi- 
vidual functional blocks (including functional blocks labeled as "processors"). The functions these blocks represent may 
be provided through the use of either shared or dedicated hardware, including, but not limited to. hardware capable of 
executing software. For example, the functions of processors presented in the figures of the drawing may be provided 
by a single shared processor. (Use of the term "processor" should not be construed to refer exclusively to hardware 
capable of executing software.) 

Illustrative embodiments may comprise digital signal processor (DSP) hardware, such as the AT&T DSP16 or 
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DSP32C readonly memory (ROM) for storing software performing the operations dscussedWwr. ^ 
DSP32C, read-only mero y j integration (VLSI) hardware embedments, as well as 

S^USS: "eech signal 16. A «*. ~ ^--^ 

feature extractor 18. and a recognizer 24. The recogn,zer 24 B "'J^^^SSI 5e HMMs (ie- the means 
embodiment of the invention, the model adaptation processor 22 adapts j he ^^ re ^ unknowTspeech signal 16 

be simultaneously estimated so that: 

{_,y} = arg max P(X a I a,y,\, H) (3) 

»*«*« <° "» HM " ^n^^C^^TS^ «. Masons SO. 

sequence of feature vectors. Such combined process can be described as follows. 

/7 T (X a (t)) = X a (t)-7. (4) 
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where **(t) is the cepstrum feature vector at time instant t warped by the warping function g () 
to thl° eSt 'T! te ° Ptimal m0de ' ada P tation Parameter, which in this specific example is a single linear bias y allied 

tribut J Z S5 r^r^f ° f featUre VeCt ° rS ' * WaS 3SSUmed ^ ° n * the hf 9 hest Q^^SIKS! 
tributes to the likelihood computation. The estimate is thus simplified as follows: 

7 = (Z(X a (t)- Uj (t))/a j (t))/(Z1/a j (t)) (5) 

^T£S2!*J!Jl e ^ and / a L ianCe 0f most ac1ive Ga ^ian j in the mixture at time instant 
»o flowtfa^S jSTJ * W3 T 9 teChn ' qUe WhiCh Can be im P |emen ^ in the feature extractor is illustrated by the 
1'°* d,a9ram 01 F,Q - 3 " »«P«1 sequences of feature vectors are generated in step 38 wherein the K warned 
sequences are each warped by a warping factor, a. which is in a range from .88 to 1 12 wheTe a « pnc rJctor of 1 fiS 
SSTS^ 00 Wan t 9 - WarPin9 COrreSP ° ndS 10 3 frequenc ^ ^formation of the '^ZlyZ^n^!^ 
of^en^ 

15 mcdl^Ah^ 09 ^ PaSS ' ^ mWarPed SeqUenCe ° f featUre veC,ors is scored a 9 ainst •» * recognition unit 

the unwarped sequence of feature vectors with the recognition unit models 

The K warped sequences of feature vectors (including one sequence warped with a warping factor of 1 OOl «r« 

» K " l et r minin9 the ,ikeHh00d 01 the ^ «S~ 9iven P the 2£3n^K££. 22 £ 

w hypothesized string ,n order to make a get of likelihoods. The maximum of the get of likel^oodTis seleTSln 2? 

eatu^v^rJ « each w^rS, Wa H Pm9 t °- ° Ptima " y ""P* Sequence of ,eature vectorc . which the 

HM JSrZllf ap,a « on P rocessor 22 < FIG D reives a sequence of feature vectors 20 as input and affects the 
melST ^?^l7 £ ? ,, ?l e reco 9 nit,ondatabas e 12. An exemplary model adaptation technique which ca b^mp?* 
mented in the model adaptation processor 22 is illustrated by the flow diagram of FIG 4 The model adaption XT 

^T.Z\TZ^S£" 01 — ™ * ^ arssss; 

..nit JiS?? 40 F,G ' ?' an inifel reco9nition P ass the sequence of feature vectors is scored against the recoanition 
^e^ 

wrth the hypothesized string is determined in step 52. The optimal model adaptation Sameter is throne aZSS 
S^SSJ"^ 6 ^P 0 * 168 ' 26 * strin 9- Assuming that only the highest scoring Gaussian in the mSure cxZ^JS to 

ZX^T?*** 1 - ,he optimal mode ' adaptation parame,er is estimate * in ste p 54 ««SS5S75^ 

™ a ^? e f??^ 0 " Param6,er defined by Eq 5 reftects a ,inear transformation to the means of the model £ 

S&^^SSSiT' vectora - " e linear ~* to a ^ ™ 

Because of the simplicity of the linear transformation (i.e.. a single linear bias) applied to the models the inverse of 
such transformation, in the form of a single linear bias, can be app.ied directly to the^put sequent i'^rZ^ 
^ r^' ThU5, ln 3 *~" iC embodiment ° f the Mention, the optima. rnode.TdaptSon SwEaT^S 
used to adapt the recognition unit models to the unknown input utterance (characterized by theTmut^wuence oi L 
ture vectors) by applying the inverse of the transformation corresponding to v to the aS^JSJE^ £ 
inverse transformation is somewhat more easily implemented. «quence or xeaxure vectors. This 

«, ^JOS-*?**? rec ° gnition P 385 ' s P eech fe recognized by scoring the linearly biased sequence of feature vectors in a 
so probabilistic alignment with the set of HMMs to generate a recognized speech signal eature vectors ,n a 

^ T SentS . 3 ,,ow . dia 9r am ^ describing joint optimization of the parameters of frequency warping and model 
m . aX,mum ' ,ke " h00d frameWOrk 3 "P** e"*odiment of the invention. ThTform of the p^edu^fo 
cZ t n ~Z ^ f SPeCtra ' "" ph,B and Sp6Ctral Shapin 9 ^ dictated .arge.y by the fart Ss s!iS 

55 o Ss f^ ? an ° Pt,mal # " B,P "2, V"* 0 " fr0m 8 relative,y ensemb, « - P°««e warping functions ■ rSC t 
S^riSi eQUenCe ° f T P&S ,6a,Ure V6C,0rS X "' Can 136 9enerated for each ^rping factor «, for i = 1 to K ^nd 
srTndrrS^T 5 °l f6atUre V6Ct0rS be ^ tD the ctosed torm «P«2n given n Eq. 5 to. coS 
S^otTSZ^ f rm ^ T° r Yh ^ ° Ptimal P 3 ^' 5 («.r) are then chosen based on the Manhood ofThe 
warped utterance with respect to the most likely transformed model. In obtaining the optimal warping factor and W Topi 
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TO 



if ^eech and an ensemble of linear warping factors a, a K . gen- 

x- 1 x" K (6) 

. cn *o m ™ki likelv word string hypotheses by performing 
(Ih N-best Recognition Hypotheses: Generate .n step 60 the ,U most hkely wora 
2 initS decoding pass for each of the frequency warped utterances. 



1 u n . .. - - < 7 > 



H 1 H |, tor i = 1 . 



• ^ • ♦« rwrive a "best" string model and generate a set 
A conventional N-best string mode. The N highly competitive string models 

of N string models which are hi ^ c °^ string mcxL. The N-best string mode, 

provide a basis for generat.ng N *™^?*«*" ™^ \ om tne rec ognition database and produces a set of 
generator receives hidden Markov model W£T£Z^ nco ^ unit modete that best match the nput 
string models which are highly c«p>IM^ made through use of DSP implementat^n 
sequence of feature vectors. Determma .on of ^^^J^^ ^ used in accordance with the prinop.es 
a?! modified Vlterbi decode,. An N-b est .f"^^ 

!S£SSS^SSSS^^ "*» AND ^ STR,NG MO • 

which is incorporated by reference as H ™^* f ^ ^ N -best string hypotheses H",forn= 1 N obtained 

(8) 

Y , for i = 1 K 

30 as given by Eq. 5 in order to increase the likelihood 

P(X°7a i .Y i .X.H n i )fori = 1 K (9> 

( ,V) Sdect Warping Factor: Given r . obtained for each «, i - 1 K compute in step 64 

(a, Y > = arg max P^/a^YiA, H n i) (10) 
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SSSmodi) consisted of 8802 singte 4304 utterances (13185 

unknown utterances to be recogn.zed adding to the » «JJJ^^ ngnuW hidde n Markov digit models with m.x- 



TABLE 1 



ADAPTATION METHOD 


DATA ERROR 
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TABLE 1 (continued) 
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ADAPTATION METHOD 


DATA ERROR 


Baseline + Warp Trained 


2.9% 


Warp 


2.5% 


Bias 


2.5% 


Warp + Bias 


2.2% 


N-Best + Warp + Bias 


2.1% 
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mJ^^^^J^^^' is *« ™. The second baseHne experiment, 

ing training. In such traLg pfoce^ 

speaker in the training set s TLT^ZIT'^^ faCt0r ^ 6Stimated to < each 
utterances) corresponding to 7 TherT ^J^* 6 " e ' S ^ franscri P«°" <«* the speaker's 
Fina,,, HMM mcxSs *? ™*« vectors *«. 

obtained from the warped utterances A 15%rJh In^ w Frances * us.ng the segmentation information 
using warping during Lni^ "Warp ^^Z^^Z^ ^ 561 "» *» 
reported in TABLE 1 . USed for *• ^ma.nder of the adaptation experiments 

den 'EZZSX^ PreferaWv « or more hid- 

An y conventiona, traU^e^ **? actual speech recognition in the fMd). 

invention. Such training can be iters tiveTnS, ! dferfnSJS^ R I accordance with *» P«"°P»es of the 

U.S. Patent No. 5.579^36 fesu^S^^e^^^ ^T^' 0 - m0dei trainin 9 is describ «* in detail in 
ING BASED ON COMPETING WO^D^r^ WOT UN,T TRAIN- 

Next. TABLE 1 shows the performance ofTe soeakef a S a ^ i T ,n «*P°rated by reference herein, 
-utterance-^ 

warping such as described previously in whirh th* flmftllM 4 ~, r , Warp '• refers t0 frequency 

to 12% expansion and JSwS^SLSiXSl J^J?*^^ """" fr0m 12% 

plays the recognition rate when a singKS ^ 6 ,0Urth ™ ° f TABLE 1 ' " Bias " dis " 

warping. The optimal bias vector max^zlTpthJ^wi STwS « ^ ^ the 086 of fr ^ er ^ 

string) obtained from a preliminary dalj^m correspond^ transcription (i.e.. hypothesis 

cas^^rJ^ ^ 

tern compared to separately offiljKS^^ .mprovement of the speech recognition sys- 

transformation and frequency *3 ^ * °P* m ^tionof both model 

the recognition unit modS ** a,n 3 mafch between unknown Frances and 

four scoring transcriptions (\T the tear b^Z^?Z ^ ^"^'"9 and bias estimation were applied to the top 
ensemble of mod* asking X^SS^^SS^ 7 *" prWides a ,ar ^ 

approximately 30%. Most J f the irSovenSnl o * ^ ^ reco 9 nition "W «te can be reduced by 
techniques, whi.e somet^XTeTo usfno N beS and biaS adaptati °" 

formation parameters The reduction n enor rat ^£?k 1 ^P 0 * 65 ** in the process of estimating the trans- 
single process is approxSetSCal to ~e3?*J!5S ^ comb '"' n 9!^^y warping and spectral shaping in a 
of the adaptation pSSSS£S£ ^ " ^ reC09niti ° n error rates ^ W^ng each 

■ous^^ 

mentioned in any claim are S^rSSJSST JS£I ^ "J* inV6nfon - Wher » teChniCa ' features 
°<~ng,heintel^ 

scope of each element identified by way of example by such reference s "gns 3 6 ^ ^ <** ° n 1,16 

Claims 



1. A signal 



processing method for recognizing unknown speech signals, comprising the following steps 
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(A) receiving an unknown speech signal; 

(B) generating a warped sequence of feature vectors characterizing the unknown speech signal; 

(C) adapting a set of recognition unit models to the unknown speech signal; and 

(D) recognizing the unknown speech signal based on the warped sequence of feature vectors and the set of 
adapted recognition unit models. 

A signal processing method for recognizing unknown speech signals, comprising the following steps: 

(A) receiving an unknown speech signal; 

(B) generating a warped sequence of feature vectors characterizing the unknown speech signal; 

(C) applying a linear transformation to the warped sequence of feature vectors; and 

(D) recognizing the unknown speech signal based on the linearly transformed warped sequence of feature 
vectors and the set of recognition unit models. 

A method as defined in claims 1 or 2, wherein: 

the set of recognition unit models comprising one or more hidden Markov models. 
A method as defined in claims 1 or 2, wherein step (B) comprises the step of: 

providing a bank of mel-scale filterbanks, wherein 

each of the bank of mel-scale filterbanks having a particular spacing and bandwidth of the f aters within the mel- 
scale f ilterbank corresponding to an amount of frequency transformation and being associated with a particular 
warping factor, and 

at least one of the bank of mel-scale filterbanks corresponding to no frequency transformation and being asso- 
ciated with a warping factor of one. 

A method as defined in claim 3, when dependent on claim 1 , wherein step (B) further comprises the steps of: 

determining an unwarped sequence of feature vectors using the at least one mel-scale f ilterbank correspond- 
ing to no frequency transformation and being associated with a warping factor of one; 

determining a hypothesized string based on a probabilistic alignment of the unwarped sequence of feature 
vectors with respect to the set of recognition unit models; 

determining one or more warped sequences of feature vectors, the feature vectors of each warped sequence 
being warped according to a different warping factor; 

determining, for each of the one or more warped sequences of feature vectors, the likelihood of the warped 
sequence of feature vectors with respect to one or more recognition unit models associated with the hypothe- 
sized string to make a set of likelihoods; 
determining the maximum of the set of likelihoods; 

selecting an optimal warping factor based on the maximum of the set of likelihoods; and 

identifying, based on the optimal warping factor, the warped sequence of feature vectors characterizing the 

unknown speech signal. 

A method as defined in claims 1 or 2, further comprising the step of: 

providing a memory means for storing the set of recognition unit models and a matrix of recognition unit model 
parameters, wherein 

each recognition unit model including one or more Gaussian distributions. 

each of the one or more Gaussian distrtoutions having a mean and a variance, and 

the matrix of recognition unit model parameters comprising the mean and the variance of each Gaussian dis- 
tribution. 

A method as defined in claim 6, when dependent on daim 1 . wherein step (C) comprises the step of: 

adjusting one or more recognition unit model parameters. 
A method as defined in claim 6, wherein step (C) comprises the steps of: 
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determining a hypothesized string based on a probabalistic alignment of the warped sequence of feature vec- 
tors with respect to the set of recognition unit models; and 

determining a model adaptation parameter based on the warped sequence of feature vectors and the mean 
and the variance of a Gaussian distribution of a recognition unit model associated with the hypothesized string; 

wherein, when dependent on claim 2, the linear transformation being based on the model adaptation ' 
parameter. 

9. A method as defined in claim 7, wherein step (C) further comprises the step of: 

shifting the mean of each Gaussian distribution in the set of recognition unit models based on the model adap- 
tation parameter. 

10. A speech recognition system, comprising: 

an acoustic transducer capable of receiving sound waves representing unknown speech and converting the 
sound waves into an electrical unknown speech signal; 

a feature extractor coupled to the acoustic transducer, wherein the feature extractor generating a warped 
sequence of feature vectors based on the unknown speech signal, the feature vectors of the warped sequence 
being warped according to a warping factor; 
a memory means for storing a set of recognition unit models; 

a model adaptation processor coupled to the feature extractor and the memory means, wherein the model 
adaptation processor adapting the set of recognition unit models to the unknown speech signal based on the 
warped sequence of feature vectors; and 

a recognizer coupled to the feature extractor and the memory means, wherein the recognizer recognizing the 
unknown speech signal based on the warped sequence of feature vectors and the set of adapted recognition 
unit models. 

1 1. A system as defined in claim 10. wherein: 

the set of recognition unit models comprising one or more hidden Markov models. 

1 2. A system as defined in claim 1 0 ( wherein: 

the feature extractor including a bank of mel-scale filterbanks. 

each of the bank of mel-scale filterbanks having a particular spacing and bandwidth of the filters within the mel- 
scale f ilterbank corresponding to an amount of frequency transformation and being associated with a particular 
warping factor. 

at least one of the bank of mel-scale filterbanks corresponding to no frequency transformation and being asso- 
ciated with a warping factor of one. 

13. A system as defined in claim 12, wherein: 

the feature extractor operating to 

determine an unwarped sequence of feature vectors, which characterizes the unknown speech signal, using 
the at least one mel-scale f ilterbank corresponding to no frequency transformation and being associated with 
a warping factor of one, 

determine a hypothesized string based on a probabalistic alignment of the unwarped sequence of feature vec- 
tors with respect to the set of recognition unit models. 

determine one or more warped sequences of feature vectors, each being warped according to a different warp- 
ing factor, characterizing the unknown speech signal. 

determine, for each of the one or more warped sequences of feature vectors, the likelihood of the warped 
sequence of feature vectors with respect to one or more recognition unit models associated with the hypothe- 
sized string to make a set of likelihoods, 
determine the maximum of the set of likelihoods, 

select the optimal warping factor based on the maximum of the set of likelihoods, and 

identify, based on the optimal warping factor, the warped sequence of feature vectors characterizing the 

unknown speech signal. 



10 



EP 0 866 442 A2 



14. A system as defined in claim 10 f wherein: 

each recognition unit model including one or more Gaussian distributions. 

each of the one or more Gaussian distributions having a mean and a variance, and further comprising 
5 a matrix of recognition unit model parameters comprising the mean and the variance of each Gaussian distri- 

bution. 

1 5. A system as defined in claim 1 4, wherein: 

10 the model adaptation processor operating to determine a hypothesized string based on a probabalistic align- 

ment of the warped sequence of feature vectors with respect to the set of recognition unit models, and 
determine a model adaptation parameter based on the warped sequence of feature vectors and the mean and 
the variance of a Gaussian distribution of a recognition unit model associated with the hypothesized string, 

wherein: the model adaptation parameter in particular corresponds to a shift in the mean of each Gaus- 

15 sian distribution in the set of recognition unit models. 
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